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Abstract —Stream mining poses unique challenges to machine 
learning: predictive models are required to he scalable, incre¬ 
mentally trainable, must remain bounded in size (even when 
the data stream is arbitrarily long), and be nonparametric 
in order to achieve high accuracy even in complex and dy¬ 
namic environments. Moreover, the learning system must be 
parameterless —traditional tuning methods are problematic in 
streaming settings— and avoid requiring prior knowledge of 
the number of distinct class labels occurring in the stream. 
In this paper, we Introduce a new algorithmic approach for 
nonparametric learning in data streams. Our approach addresses 
all above mentioned challenges by learning a model that covers 
the input space using simple local classifiers. The distribution 
of these classifiers dynamically adapts to the local (unknown) 
complexity of the classification problem, thus achieving a good 
balance between model complexity and predictive accuracy. We 
design four variants of our approach of increasing adaptivity. 
By means of an extensive empirical evaluation against standard 
nonparametric baselines, we show state-of-the-art results in terms 
of accuracy versus model size. For the variant that imposes a 
strict bound on the model size, we show better performance 
against all other methods measured at the same model size 
value. Our empirical analysis is complemented by a theoretical 
performance guarantee which does not rely on any stochastic 
assumption on the source generating the stream]^ 

I. Introduction 

As pointed out in various papers —see, e.g., ca, EH— 
stream mining poses unique challenges to machine learning: 
examples must be efficiently processed one at a time as 
they arrive from the stream, and an up-to-date predictive 
model must be available at all times. Incremental learning 
systems are well suited to address these requirements: the key 
difference between a traditional (batch) learning system and an 
incremental one is that the latter learns by performing small 
adjustments to the current predictor. Each adjustment uses only 
the information provided by the current example in the stream, 
allowing an efficient and timely update of the predictive model. 
This is unlike batch learning, where training typically involves 
a costly global optimization process involving multiple passes 
over the data. 

Another important feature of stream mining is that the true 
structure of the problem is progressively revealed as more 
data are observed. In this context, nonparametric learning 
methods, such as decision trees or nearest neighbour (NN), 
are especially effective, as a nonparametric algorithm is not 
committed to any specific family of decision surfaces. For 
this reason, incremental algorithms for decision trees |j71, ll22ll. 

’This paper is a longer version of the conference paper (6l . 


IIT 9 I , ISll , B1 and nearest neighbour are extremely popular 
in stream mining applications. 

Since in nonparametric methods the model size keeps 
growing to fit the stream with increasing accuracy, we seek a 
method able to improve predictions while growing the model 
as slowly as possible. However, as the model size cannot grow 
unbounded, we also introduce a variant of our approach that 
prevents the model size from going beyond a given limit. In the 
presence of concept drift ca, m, bounding the model size 
may actually improve the overall predictive accuracy, provided 
the data point supporting the model are selected in the right 
way. 

A further issue in stream mining concerns the way predic¬ 
tion methods are evaluated —see, e.g., ifTSl for a discussion. In 
this paper, we advocate the use of the online error (also called 
sequential risk, prequential risk, or prequential error mi 
This quantity measures the average of the errors made by the 
sequence of incrementally learned models, where one first tests 
the current model on the next example in the stream and then 
uses the same example to update the model. The sequential 
risk is therefore measured on each individual stream and does 
not specifically require stochastic assumptions on the way the 
stream is generated. 

In this paper, we propose a novel incremental and non¬ 
parametric approach for the classification of data streams. We 
present four different instances of our approach (called BASE, 
BASE-ADJ, AUTO, and AUTO-ADJ) characterized by an in¬ 
creasing degree of adaptivity to the data. In particular, AUTO- 
ADJ is fully parameterless, a feature especially important in 
streaming settings where tuning is a hard task. Even though 
our algorithms are instance-based like nearest neighbour, the 
learned models are significantly smaller than those produced 
by competing baselines and more accurate when the online 
performance is measured against the model size. Finally, 
our methods (except BASE) are natively multiclass and can 
dynamically accommodate new classes as they appear in the 
stream. 

In a nutshell, our algorithms work by incrementally cover¬ 
ing the input space with balls of possibly different radii. Each 
new example that falls outside of the current cover becomes 
the center of a new ball. Examples are classified according to 
NN over the ball centers, where each ball predicts according 
to the majority of the labels of previous examples that fell in 
that ball. The set of balls is organized in a tree structure El, 
so that predictions can be computed in time logarithmic in the 
number of balls. In order to increase the ability of the model 



to fit new data, the radii of the balls shrink, thus making room 
for new balls. The shrinking of the radius may depend on time 
or, in the more sophisticated variants of our algorithms, on the 
number of classification mistakes made by each ball classifier. 
Similarly to decision trees, where leaves are split according to 
their impurity, our method locally adapts the complexity of the 
model by allocating more balls in regions of the input space 
where the stream is harder to predict. A further improvement 
concerns the relocation of the ball centers in the input space: as 
our methods are completely incremental, the positioning of the 
balls depends on the order of the examples in the stream, which 
may result in a model using more balls than necessary. In order 
to mitigate this phenomenon, while avoiding a costly global 
optimization step to reposition the balls, we also consider a 
variant in which a K-means step is used to move the center 
of a ball being updated towards the median of the data points 
that previously fell in that ball. A further modification which 
we consider is aimed at keeping the model size bounded even 
in the presence of an arbitrarily long stream. This is achieved 
by introducing a randomized mechanism for discarding balls 
when the size bound is reached. Specifically, the mechanism 
discards a ball with probability proportional to the mistake rate 
of the ball classifier. The underlying idea is to get rid of the 
model parts that contribute the most to the global error and 
may replaced by a better arrangement of balls. 

In summary, we introduce a simple and flexible approach 
for nonparametric classification of data streams. Our approach 
is fully modular: we predict using majority voting, but a fully 
trainable classifier could be used instead. The simplest version 
of our approach, applicable to streams with binary labels, 
enjoys strong theoretical guarantees: its mistake rate on any 
arbitrary stream converges to that of the best classihcation 
function that satisfies a certain regularity condition. The more 
complex versions of our approach learn multiclass classifiers 
without knowning the number of distinct labels in advance. We 
empirically show that our methods are excellent at trading-off 
classification accuracy with model size. Our most sophisticated 
method is fully parameterless. Finally, we show that a simple 
modification of our approach allows to keep the model size 
bounded, outperforming other methods measured at the same 
value of model size. 


The rest of the paper is organized as follows. Section 
discusses related work. In Section III we define the problem 
setting. In Section |IV| we present our nonparametric classifi¬ 
cation approach. In Section |IV-A[ we discuss the theoretical 
properties of our approach and derive a formal performance 
guarantee for the simplest algorithm. We then introduce three 
more sophisticated versions that are empirically more effective. 
In Section VI we test the behaviour of our algorithms against 
state-of-the-art baselines. In Section VII we introduce a simple 
modihcation of our approach to keep the model size bounded. 
Finally, Section VIII| concludes the paper. 


II. Related Work 

Within the vast area of stream mining m, we focus our 
analysis of related work on the subarea that is most relevant 
to this study: nonparametric methods for stream classification. 
The most important approaches in this domain are: 

Incremental decision and rule tree learning systems, such 
as Very Fast Decision Tree (VFDT) Cl and Decision Rules 


(RULES) ifT^ which use an incremental version of the split 
function computation —see also lfT9l . ISl , B1 . 

Incremental variants of NN, such as Condensed Nearest 
Neighbour (CNN) EZl that stores only the misclassified in¬ 
stances, Lazy-Tree (L-Tree) 1^ condensing historical stream 
records into compact exemplars, and IBLStreams ll2^ . an 
instance-based learning algorithms removing outliers or ex¬ 
amples that have become redundant. 

Incremental kernel-based algorithm^ such as the kernel 
Perceptron Qo) with Gaussian kemelsH 

Note that our methods do not belong to any of the above 
three families: they do not perform a recursive partition of the 
feature space as decision trees, they do not allocate (or remove) 
instances based on the heuristics used by IBLStreams, and they 
do not use kernels. 

As we explain next, our most basic algorithm is a variant 
for classification tasks of the algorithm proposed in ifT^ for 
nonparametric regression in a streaming setting. A similar 
algorithm was previously proposed in HH and analyzed 
without resorting to stochastic assumptions on the stream 
generation. A preliminary instance of our approach, without 
any theoretical analysis, was developed in 0 for an action 
recognition application in video feeds. 

III. Problem Setting 

Our analysis applies to streams of data points belonging to 
an arbitrary metric space and depends on the metric dimension 
of data points in the stream. This notion of dimension extends 
to general metric spaces the traditional notions of dimension 
(e.g.. Euclidean dimension and manifold dimension) El. The 
metric dimension of a subset S' of a metric space {X, p) is 
d if there exists a constant Cs > 0 such that, for all e > 0, 
S has an e-cover of size at most Cse~‘^ (an e-cover is a set 
of balls of radius e whose union contains S). In practice, the 
metric dimension of the stream may be much smaller than the 
dimension of the ambient space X. This is especially relevant 
in case of nonparametric algorithms, which typically have a 
bad dependence on the dimensionality of the data. Note that 
our algorithms do not require knowledge of d: the metric 
dimension of the stream is automatically estimated from the 
data. 

The learner receives a sequence {xi,yi), {x 2 ,y 2 ), ■ ■ ■ of 
examples, where each data point Xt G X is annotated with a 
label yt from a set V = {1, ■ ■ ■, K} of possible class labels, 
which may change over time. The learner’s task is to predict 
each label yt minimizing the overall number of prediction 
mistakes over the data stream. 

We derive theoretical performance guarantees for BASE, 
the simplest algorithm in our family (Algorithm |^, without 
making stochastic assumptions on the way the examples in 
the stream are generated. Note that this is a very strong type 
of guarantee: our results hold on any individual stream of 
annotated data points. 

^Gaussian kernels are universal 03, meaning that a kernel-based model can 
approximate any continuous classification function. Hence, algorithms using 
Gaussian kernels can be viewed as instance-based nonparametric learning 
algorithms. 











Algorithm 1 ABACOC TEMPLATE 
Input: metric p 

1 : Initialize set of ball centers 5 = 0 
2: InitProcedure {) 

3: for f = 1, 2, ... do 
4: Get input example {xt,yt) 

5: ifyt^y then 

6 : Set 3^ = 3^ U {yt} II add new class on the fly 

7: end if 

8 : Let B{xs,£s) be the ball in S closest to Xt 

9: 0uputPrediction(5s) 

10 : if p{xs,xt) < £s then 

11: 5=UpdateBallInf ormation(5s, {xt, yt)) 

12: else 

13: B=AddNewBall{S,Xs,{xt,yt)) 

14: end if 

15: UpdateEpsilon(5) 

16: end for 


IV. Adaptive Ball Covering 

The adaptive ball covering at the roots of our method was 
previously used in a theoretical work M- Here, we distillate 
the main ideas behind that approach in a generic algorith¬ 
mic approach (the template Algorithm [T]| called ABACOC 
(Adaptive BAll COver for Classification). We then present our 
methods as specific instances of this generic template. 

A. The BASE Algorithm 

Our first instance of ABACOC is BASE (Algorithm]^, a 
randomized variant for binary classification of the ITBR (In¬ 
cremental Tree-Based Regressor) algorithm proposed in ifT^ . 
BASE shrinks the radius (line 28) of the balls depending on (1) 
an estimate of the metric dimension of the stream and (2) the 
number of data points so far observed from the stream. This 
implies that the radii of all the balls shrink at the same rate. In 
the prediction phase, the ball nearest to the input example is 
considered and a randomized binary prediction is made based 
on the class distribution estimate locally computed in the ball. 
Laplace estimators (line 5) and randomized predictions (lines 
6-8) are new features of BASE that were missing in ITBR. 

We now analyze the performance of BASE using the notion 
of regret m. The regret of a randomized algorithm is defined 
as the difference between the expected number of classification 
mistakes made by the algorithm over the stream and the 
expected number of mistakes made by the best element in 
a fixed class of randomized classifiers. A randomized binary 
classifier is a mapping / : A —>■ [0,1], where f{x) is the 
probability of predicting label -fl. We consider the class 
of L-Lipschitz predictors f : X —t [0,1] w.r.t. the metric p of 
the space. Namely, 

'ix,x'^X, \]{x)-j{x')\<Lp(x,x') . 

Hence, a predictor is Lipschitz if, when we perturb the data 
point X, the prediction changes by an amount linear in the 
perturbation size. Lipschitz functions are a standard reference 
in the analysis of nonparametric algorithms. 

The regret of BASE generating randomized predictions yt 


is defined by (see also d) 

T T 

Rl{T) =J2F{yt 7^ yt) - min ^P(/(a;t) ^ yt) . 

t=i t=i 

Eor the BASE algorithm we can prove the following regret 
bound against any Lipschitz randomized classifier, without any 
assumption on the way the stream is generated. Moreover, 
similarly to ITBR, the regret upper bound depends on the 
unknown metric dimension d of the space, automatically 
estimated by the algorithm. 


Theorem 1: Eix a metric p and any stream {xt,yt) t = 
1, ..., T of binary labeled points S = {xi,..., x^} in a 
metric space {X, p) of diameter 1 and let d be the metric 
dimension of S. Assume that Algorithm is run with 
parameter C > Cs, where Cs is such that Cse~'^ upper 
bounds the size of any e-cover of S. Then, for any L > 0 
we have 

Rl{T) < 1.26 (2.5v^2'^-f 1.5 l) . 


The proof is in the next Section |V] Note that the algorithm 
does not know L, hence the regret bound above holds for all 
values of L simultaneously. This theorem tells us that BASE 
is not an heuristic, but rather a principled approach with a 
specific performance guarantee. The performance guarantee 
implies that, on any stream, the expected mistake rate of BASE 
converges to that of the best L-Lipschitz randomized classifier 
at rate of order {2‘^ + 

Next, we generalize the BASE algorithm to multiclass clas¬ 
sification, and make some modifications aimed at improving 
its empirical performance. 

V. Proofs 

We use the following well-known fact; if pt = P(yt = 
1) for predicting yt G {0,1} using a randomized label yt G 
{0,1}, then P(yt ^ yt) = \yt - Pt\- 

Even if our algorithm is different from ITBR, we can still 
use the following lemma from ITBR analysis ifT^ . In the 
following, we say that a phase ends each time condition in 
line 15 of BASE is verified and use Tt to denote the time 
steps included in phase i. Einally, St denotes the maximum 
number of balls used in phase i. 

Lemma 1 Suppose BASE is run with parameter 

C > Cs- The following invariants hold throughout the pro¬ 
cedure for all phases * > 1: 

• i < di < d. 

• Eor any t G Ti we have \Si\ < C 

Define £t{pt) = \pt —yt\- Unlike the analysis in ifTbll . here we 
cannot use a bias-variance decomposition. So, the key in the 
proof is to decompose the regret in two terms with behaviour 
similar to the bias and variance terms in the stochastic setting. 







Algorithm 2 BASE 


Input: C (space diameter) 


8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 
29 


procedure In i tProcedure 

S — (1), i = 1, ti = 0, and dt = 1 

end procedure 

procedure OuputPred i ct i on(,Bs) 

~ ^ laplace estimator of counts 

r 0 if 9s < i - 7s 

Set pt = <1 1 if 9s > I + 7s 

[^ + ids - 1)/ (27s) otherwise. 

Predict yt = 1 with probability pt and 0 otherwise. 

end procedure 

procedure UpdateBallInformation(Ss, {xt,yt)) 
rus = ms + yt > number of yt = 1 in the ball 

IT'S = Us + I > total number of points in the ball 

end procedure 

procedure AddNewBall(iS, Xs, {xt,yt)) 

if |iS| + 1 > then > dimension check 

5 = 0 > start Phase z + 1 

di+i = [log(J^)/log(2/et)] 
i = i + 1 
U = 0 
end if 

S = SU {xt] 

mt = yt > number of = 1 in the ball 

rit = 1 > first point in the ball 

ti = ti + 1 > counts the time steps within phase i 

end procedure 
procedure UpdateEp si lon 

// radius dependent on current time step 

_ .-l/(2+di) 

£t - 

end procedure 


Lemma 2: Let d be the metric dim^sion of the set S of 
data points in the stream. Assume that C > Cs- Then, in any 
phase i and for any / G we have that 

J^(£t(pt)-£t(f(xt))) < (2VC2"-+i + 1.5L)n,"^. 

tGTi 

Proof: We use the notation Xt -G Xg to say that Xt is 
assigned to a ball with center Xg. We also denote by n{xs) 
the number of points assigned to a ball of center Xg- Define 

p*=argmin ^ itip). 
pe[o.i] 

For each Xg in Si, we proceed by upper bounding the error as 
a sum of two components 

- ^t{p*g)) 

t\Xt—^Xs t:x±—^Xs 

t: Xt—^Xs 

Using the definition of p* and the Lipschitz property of /, we 
have 

^t{p*g) - it{f{xt)) < itifixg)) - it{f{xt)) 

< \f{xg) - f{xt)\ < Lp{xg,Xt) < Let ■ 


The prediction strategy in each ball is equivalent to the 
approach followed in ID (see also Exercise 8.8 in m). The 
only important thing to note is that the first prediction of the 
algorithm in a ball is made using the probability of the closest 
ball, even if it is further than et, instead of at random as in 
the original strategy in M- It is easy to see that this adds an 
additional 0.5 to the regret stated in E). So we have 

(^t(pt) - ^t{p*g)) < \/n{xg) + 1 + 1 < 2.5^/n{xg) . 

t : Xt^Xs 


Hence overall we have 

X! {^t{pt) - it{f{xt))'^ <2.5y/n{xg) + L et . 

t:xt^Xs t\Xt—^Xs 


Summing over all the Xg G Si, we have 


tGTi 


15.1 

< 2.5 ^/n{xg) + T y^ et 

S— 1 tG Ti 


< 2.5|5,| 


V tGTi 


— 2.5\/|5i|ni + L 'y ) et- 

teTi 


To bound |5t| we use Lemma while to bound the last term, 
we have 


teTi 


et 




2 + di < 


r 2+*^^ dr = 


di-\-l 

2+di 


di + 1 


< 1.5n, 


d^ + l 
2+77 


where rii = \Ti\. Overall we have 

tGTi 

< 2.5\/52‘'’np*^^" + l.SLnf^ 

= {2.5\/d2^' +1.5L)nf^ 

< {2.5'/d2‘^ + 1.5L)np^ . 


We finish with the proof of Theorem [T] 

Proof: Let I denote the number of phases up to time T. 
Let B = 2.hVd2'^ + 1.5L. We use Lemma in each phase 
and sum over the phases, to have 


T I 

E = E E “ ^t{pi)) 

2=1 tGTi 


^ ^ 1 l + d / ^ 

<BY,nr =BlY,-.nr <Bl[Y, 


2=1 


2=1 


^2=1 


= Bn- 


i+d 
J '\ 2 + d 


1 1 +d 1 +d 

< < 1.26BT^+<i 


1+d 
2 + d 




















Algorithm 3 BASE-ADJ (BASE with ball adjustment) 

Input: C (space diameter) 

1 : procedure OuputPredICTIon(Ss) 

2: rig = ns(l) + • • • + ns{K) > total class counts 

3: pg(fc) = k = l,...,K 

4: Predict yt = argmaxps(fc) 

key 

5: end procedure 

6: procedure UpdateBallInformation(;Bs, (a;t, 2 /t)) 

7: // update ball centre on correct prediction 

8: if yt = yt then 

9 . ^ Xt Tig Tig 1 , 

10: Xg = Xg + A/tIs 

11 : end if 

12 : Updates label counts ng(l),..., Tig{K) in the ball Bg 

using yt 

13: end procedure 


where in the second inequality we use Jensen’s inequality, and 
in the second to last inequality the first statement of Lemma [T] 


A. The BASE algorithm with ball adjustment 

A natural way of generalizing the BASE algorithm to the 
multiclass case is by estimating the class probabilities in each 
ball. Note that this approach is naturally incremental w.r.t. the 
number of classes: new bins for counting are created on the 
fly as data points of new classes arrive. 

Recall that the BASE algorithm greedly covers the input 
space. In particular, balls are always centered on input points. 
However, constraining the centers on data points is an intu¬ 
itively sub-optimal strategy: it might be possible to cover the 
same region with a smaller number of balls if we could freely 
move their centers. As a full optimization of the position of the 
centers is not realistic in a streaming scenario, we introduce 
the BASE-ADJ variant which makes a partial optimization by 
using a step of the K-means algorithm ifTSll . More precisely, 
BASE-ADJ (Algorithm only the main changes w.r.t. BASE 
are shown) moves the center of each ball towards the average 
of the correct classihed data points falling into it. In this way, 
the center of the ball tends to move towards the centroid of 
a cluster of points of a certain class. We expect this variant 
to generate less balls and also to have a better empirical 
performance. 

We drop from BASE-ADJ the Laplace correction of class 
estimates and the randomization in the computation of the 
predicted label. Although these ingredients were used in the 
theoretical analysis, we noticed that they do not signihcantly 
affect the empirical results. Hence, BASE-ADJ always predicts 
the class with the largest class probability estimate (majority 
voting on the collected labels) within the ball closest to the 
current data point. 

B. The AUTO algorithm: automatic radius 

One of the biggest issues with BASE (and ITBR) is the 
use of a common radius for all the balls. In fact, in line 28 
of Algorithm we have that the radii Eg shrink uniformly 
with time t at rate where dt is the estimated 


Algorithm 4 AUTO and AUTO-ADJ 

Input: d 

1: procedure InitProcedure 

2: // wait until at least two different labels fed 

3: if 5 = 0 then 

4: S = {xi] and initialize label counts 

5: else if yt ^ yi then > |5| = 1 

6 : S = S U {xt},ei = Et = p{xi,Xt) 

7: Initialize label counts 

8: else 

9: continue 

10 : end if 

11 : end procedure 

12 : procedure OuputPrediction(Ss) 

13: rig = ns(l) + • • • + tIs{K) total class counts 

14: k = l,...,K 

15: Predict yt = argmaxpg{k) 

key 

16 : end procedure 

17: procedure UpdateBallInformation(Ss, (a:t,yt)) 

18: // shrink radius on errors 

19: if yt ^ yt then 

20: Set Trig = nig -1-1 > update mistakes count 

21 : else if AUTO-ADJ method then 

22: // update ball centre if correct prediction 

23: A = Xt-Xg-,Ug = Ug + l; 

24: Xg = Xg + A/Ug 

25: end if 

26: Updates label counts ns(l),..., Tig{K) in the ball Bg 

using yt 

11 -. end procedure 

28: procedure AddNewBall( 5, aig, (a;*, y*)) 

29: S ^ Sy}{xt\, Rt = p(Xf,Xs) 

30: mt = 0 > ball mistakes count 

31: Ut = 1 > center updates count (for AUTO-ADJ) 

32: Initialize label counts ..., Tig{K) in the ball Bt 

using yt 

33: end procedure 

34: procedure UpdATeEp sI lon(;Bs) 

35: // radius dependent on mistakes 

36: Eg = Rg 

37: end procedure 


metric dimension. However, we would like the algorithm to 
use smaller balls in regions of the input space where labels are 
more irregularly distributed and bigger balls in easy regions, 
where labels tend to be the same. 

In order to overcome this issue, in this section we introduce 
two other instances of ABACOC; AUTO and AUTO-ADJ. 
In these variants we let the radius of each ball shrink at a 
rate depending on the number of mistakes made by each local 
ball classifier, lines 20 and 36 in Algorithm Moreover, in 
order to get rid of the parameter C used to estimate the metric 
dimension, we initialize the radius of each ball to the distance 
to its closest ball, line 29 in Algorithm In other words, 
everytime a new ball is added its radius is set equal to the 
distance to the nearest already-existing ball. 

AUTO-ADJ differs from AUTO because it implements 
the same strategy, introduced in BASE-ADJ, for updating the 
















position of the centers. Note that this strategy, coupled with 
the shrinkage depending on the number of mistakes, makes a 
ball stationary once it is covering a region of the space that 
contains data points always annotated with the same label. 

Using balls of different radii makes it impossible to work 
with the automatic estimate of the metric dimension used in 
BASE, BASE-ADJ and ITBR. Eor this reason, we fuither 
simplify the algorithms by resorting to a fixed estimate d of 
the intrinsic dimension d as an input parameter. 

VI. Experiments 

In this section, we describe baselines and datasets used in 
the experiments and report on the obtained results. We con¬ 
ducted an extensive evaluation on standard machine learning 
datasets for the streaming setting. Generally, in real applica¬ 
tions for high-speed data streams, when the system cannot 
afford to revise the current model after each observation of 
a data point, stream sub-sampling is used to keep the model 
size and the prediction efficiency under control. In order to 
emphasize the distinctive features of our approaches (i.e., good 
trade-off between accuracy and model size), we tested the 
online (prequential) performance using sub-sampling —see 
Algorithm 5. In this setting, the algorithms have access to 
each true class label only with a certain probability. By varying 
this probability, we can explore different model sizes for each 
baseline algorithm and compare the resulting performances. 
Note also that, while in this work we only consider random 
sub-sampling, different and more active sampling schedules 
could be also envisioned. 

A. Baseline and datasets 

We considered eleven popular datasets for stream mining 
listed in Table U 


Data 

Cls 

Dim 

Examples 

Drift 

Source 

sensor 

54 

5 

2,219,803 

no 

SDMR 

kddcup99 

23 

41 

494,021 

no 

SDMR 

powersupplv 

24 

2 

29,928 

yes 

SDMR 

hyperPlane 

5 

10 

100,000 

yes 

SDMR 

sea 

2 

3 

60,000 

yes 

DF 

poker 

10 

10 

25,010 

no 

MOA 

covtype 

7 

54 

581,012 

yes 

MOA 

airlines 

2 

608 

539,383 

yes 

MOA 

electricity 

2 

8 

45,312 

yes 

MOA 

connect'4 

3 

126 

67,557 

no 

LIBSVM 

acoustic 

3 

50 

78,823 

no 

LIBSVM 


TABLE I. Datasets used for benchmarking. 


As indicated in the table, datasets are from the Stream 
Data Mining repository (SDMR) ll29l . the Data Sets with Con¬ 
cept Drift repository (DE) 1251 . the Massive Online Analysis 
(MOA) collectioiU and the LIBSVM classification reposi- 
tor>0 In ^11 experiments, we measured the online accuracy 
(prequential error in fT3l or “Interleaved Test-Then-Train” 
validation in M0A[^. This is the average performance when 
each new example in the stream is predicted using the classifier 
trained only over the past examples in the stream —see 
Algorithm 5 (line 6). 

-moa.cms.waikato.ac.nz/datasets/ 

4 www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ 

“moa.cms.waikato.ac.nz/ 


Algorithm 5 Online sub-sampling evaluation protocol 

Input: rate. Stream {xi,yi),{x 2 ,y 2 ), ■ ■ ■ 

1 : Initialize online accuracy Mq = 0 

2: for f = 1,2,... do 

3: Receive instance Xt from stream 

4: Compute class label prediction yt 

5: Receive true class label yt 

6 : Update Mt = (l - + jl{yt = Vt} 

1: if randO < rate then 

8 : Update model with new example {xt,yt) 

9: end if 

10 : end for 


In a pre-processing phase, the categorical attributes were 
binarized. BASE and BASE-ADJ received normalized input 
instances (Euclidean norm) allowing the input parameter C 
(space diameter) to be set to 1. We compared our ABACOC 
methods BASlj^ (Algorithm 2), BASE-ADJ (Algorithm 3), 
AUTO and AUTO-ADJ (Algorithm 4) against some of the 
most popular incremental nonparametric baselines (see Sec¬ 
tion 0 in the stream mining literature; K-NN with parameter 
K = 3 (NN3) (see next paragraph for a justification of 
this choice). Condensed Nearest Neighbor Ea (CNN), a 
streaming version of NN which only stores mistaken points, 
the multiclass Perceptron with Gaussian kernel ia (K-PERC), 
a decision tree algorithm for streaming data 0 (VDET), and 
a recent algorithm for learning decision rules on streaming 
data Ha (RULES). For VDET and RULES we used the 
implementation available in MOA, while K-PERC was run 
using the code in DOGMA |^. The ABACOC algorithms 
were implemented in MATLABH We did not consider the L- 
Tree ll28l and IBLStreams Il23]l methods described in Section [n| 
as L-Tree is an efficient approximation of NN (outperformed 
by NN, see ll28l ) and IBLStreams never performs better than 
RULES (both implemented in MOA) on our datasets. 

Where necessary, the parameters of the competitor methods 
were individually tuned on each dataset using an algorithm- 
specific grid of values in order to obtain the best online 
performance. Hence, the results of the competitors are not 
worse than the ones obtainable with a tuning of the parameters 
using standard cross-validation methods. For our methods, we 
used the Euclidean distance as metric p. Based on preliminary 
experiments, we noticed that the parameter d does not affect 
significantly the performance in AUTO and AUTO-ADJ, so 
we set it to 2. With d fixed to this value, our methods are 
essentially parameterless, which is a very attractive feature in 
a streaming setting where cross-validation can not be easily 
applied. 

B. Comparison among our methods 

First, we compared the empirical behaviour of all our al¬ 
gorithms on the two-dimensional dataset banana^m Figure ^ 
The simplicity of this dataset allows us to show visually the 
difference between the four algorithms. BASE is seen to have 
many overlapping balls. On the other hand, AUTO has balls 


®We used the multiclass version as for BASE-ADJ. 
^code available at http://mloss.org/software/view/560/ 

^ http://mldata.org/repository/data/viewslug/banana-ida/ 













(a) BASE (b) BASE-ADJ (c) AUTO (d) AUTO-ADJ 


Fig. 1. Empirical behaviours of all versions of ABACOC algorithm on 2000 datapoints of the banana dataset. The intensity of the colour of each ball is 
proportional to the conditional class probability of the two classes. 
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Fig. 2. Model size and online performance averaged over all datasets in Table |T] of our four methods. Performances are computed by normalizing each 
performance relative to the best performer for each dataset, and then averaging over the datasets. 


of different radii and not so overlapping. Finally, BASE-ADJ 
and AUTO-ADJ, the variants of BASE and AUTO that update 
the centers of the balls, have a smaller number of balls than 
BASE and AUTO respectively. Also, note how the use of a 
varying shrinking radius in AUTO and AUTO-ADJ results in 
bigger balls that cover very large regions of the space. To verify 
the intuition emerged from Eigure [T] we empirically tested 
the performance of our methods on the entire benchmark of 
Table 1^ running Algorithm 5 with rate = 1. In Eigure]^ a), 
we show the resulting model sizes in terms of the stream 
length percentage used to represent the models (fraction of 
input samples used as ball centers) of each method averaged 
over all datasets in our benchmark suite. Eigure |^b) shows the 
average normalized accuracy of each method as a fraction of 
the accuracy of the best-performing method on each dataset. 
Note that, due to the adjustment procedure added to BASE- 
ADJ and AUTO-ADJ, they use a small fraction of data to 
represent their models while achieving a performance better 
than, respectively, BASE and AUTO. Einally, we observe that 
AUTO-ADJ simultaneously achieves the smallest model and 
the best performance. 

C. Comparison against baselines 

We now turn to describing the sub-sampling experiments. 
In a streaming setting, the model size and thus the compu¬ 
tational efficiency of the prediction system is a key feature. 
The goal of the experiments is to show the trade-off between 
online performance and model size for each algorithm. The 
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Fig. 3. Online performance against model size averaged over the datasets. 
The model size is relative to the stream length, whereas the online performance 
is measured relative to the top-performing method on each dataset without 
restriction on model size. 


model size is measured by; the number of balls used to cover 
the feature space (ABACOC), the number of stored instances 
(K-PERC, NN, CNN), the number of leaves (VEDT) or rules 
(RULES) used to partition the feature space. 








We ran all the methods using values rate = 
{1%, 3%j5%, 10%} and the same random seeds for all al¬ 
gorithms!^ In Figure [3 we plot the normalized online per¬ 
formance against model size, averaged over the datasets. The 
model size is relative to the stream length, whereas the online 
performance is measured relative to the top-performing method 
on each dataset without restriction on model size. As we can 
see from the plot, NN3 saturates the model size and achieves 
a slightly better overall performance on the larger model sizes. 
However, it suffers at low budget values and small model 
sizes. CNN works better than K-PERC and decision trees. 
VFDT and RULES use very little memory but have a worse 
performance than the other methods. BASE-ADJ improves on 
the performance of BASE. AUTO attains a better performance 
than BASE and AUTO-ADJ achieves the overall best trade¬ 
off between accuracy and model size. In fact, as we can see 
in Figure the AUTO-ADJ curve dominates the other ones. 
Moreover, it attains 90% of the best full-sampling methods 
while using only 1.5% of the data to represent the model. 
Because of the better performance exhibited by our methods 
with respect to the baselines at the same model size values, 
we can infer that our methods have a better way of choosing 
the data points that define their models. 


a constant model size bound. With respect to sub-sampling, 
here the algorithm has more control over the data points that 
support the model. We report in Table and in Table 


III the 


performance with budget 10% and 1% of the method AUTO- 
ADJ with constant budget, called AUTO-ADJ FIX, compared 
to NN3 and AUTO-ADJ which performed the best in the 
previous experiments using the same final model sizes. As 


Data 

NN3 

AUTO-ADJ 

AUTO-ADJ FIX 

kddcup99 

.714|.100 

.614|.010 

.792 j.069 

poker 

.6771.100 

.710 .003 

.719 .036 

connect-4 

.592 .100 

.605 .011 

.635 .026 

acoustic 

.348 .100 

.352 .003 

.353 .023 

sensor 

.680 .100 

.667 .009 

.748 .075 

hyperPlane 

.416 .100 

.385 .028 

.417 .100 

electricity 

.295 .100 

.266 .011 

.530 .093 

Dowersupplv 

.650 .100 

.630 .020 

.653 .099 

airlines 

.682 .100 

.654 .027 

.641 .100 

sea 

.502 .100 

.489 .021 

.502 .100 

covtype 

.956 .100 

.980 .001 

.979 .001 


TABLE II. Summary of the online performance (left) and 

MODEL SIZE (RIGHT) ON THE FULL BENCHMARK SUITE OF THE BEST 
THREE ALGORITHMS RUN WITH BUDGET 10% OF THE TOTAL STREAM 
LENGTH (MODEL SIZE IS ALSO EXPRESSED AS A ERACTION OE THE 
STREAM LENGTH). 


VII. Constant model size 


In this section we propose a simple method for making 
the memory footprint bounded, even in the presence of an 
arbitrarily long data stream. When the model size reaches 
a given limit, the algorithm starts to discard the examples 
supporting the model that are judged to be less informative 
for the prediction task. More precisely, it is reasonable to 
discard the local classifiers that are making the largest number 
of mistakes. This happens essentially for two reasons: 1) the 
optimal decision surface in that region is complex and/or 
the noise rate is high; 2) there is concept drift ll26l . that is 
the optimal decision surface is locally changing over time. 
Removing local classifiers with a high mistake rate may then 
help because: we are discarding classifiers that are making 
essentially random decisions; moreover, we make room for 
new classifiers that rely on fresh statistics (good in case of 
concept drift) and are possibly better positioned to capture a 
complex decision surface. Thus, in order to curb the memory 
footprint, we propose a simple approach based on deleting 
existing balls whenever a given budget parameter is attained. 
This is crucial for real-time applications, as NN search in the 
prediction phase is logarithmic on the number of balls. The 
probability of deleting any given ball is proportional to the 
number of mistakes made so far by the associated classifier. 
Namely, after the budget is reached, whenever a new ball is 
added an existing ball i is discarded according to the Laplace- 
corrected probability 

Tfi ■ “1“ 1 

P(i discarded) = =-^-- (1) 


where rrii is the number of mistakes made by ball i G S. 
We run the experiments in the same setting of Section VI-C[ 
where we did not make any restriction on the sub-sampling 
rate (rate = 1 in Algorithm 5). We added to AUTO-ADJ 


®We remark that the rate is only an upper bound on the model size. In 
fact, the methods can select a smaller fraction of data to represent the model. 


Data 

NN3 

AUTO-ADJ 

AUTO-ADJ FIX 

kddcup99 

.550|.010 

.501 j.001 

.654 j.009 

poker 

.674 .010 

.691 .001 

.710 .010 

connect-4 

.575 .010 

.590 .003 

.603 .010 

acoustic 

.345 .010 

.347 .001 

.349 .009 

sensor 

.614 .010 

.620 .001 

.759 .009 

hyperPlane 

.391 .010 

.361 .003 

.427 .010 

electricity 

.130 .010 

.120 .001 

.621 .010 

powersupplv 

.609 .010 

.586 .001 

.622 .009 

airlines 

.634 .010 

.590 .003 

.668 .010 

sea 

.456 .010 

.462 .002 

.473 .009 

covtype 

.945 .010 

.975 .001 

.979 .001 


TABLE III. SUMMARY OF THE ONLINE PERFORMANCE (LEFT) AND 
MODEL SIZE (RIGHT) ON THE FULL BENCHMARK SUITE OF THE THREE 
BEST ALGORITHMS RUN WITH BUDGET 1% OF THE TOTAL STREAM 
LENGTH (MODEL SIZE IS ALSO EXPRESSED AS A FRACTION OF THE 
STREAM LENGTH). 


we can observe from these tables, AUTO-ADJ FIX generally 
outperforms the other methods at the same model sizes. This 
is very evident on the datasets with drift, such as electricity, 
and when the budget limit is very small (1% of the total 
stream length). Along the same lines of Figure we show 
in Figure the overall performance of the compared methods 
using all the budget/rate values {1%, 3%, 5%, 10%}. AUTO- 
ADJ FIX clearly outperforms all the other methods. This is not 
surprising, as AUTO-ADJ FIX has a better way of choosing 
the data points supporting the model as opposed to the random 
selection imposed on the other methods. 

VIII. Conclusion and Future Works 

We presented an intuitive and easy to implement approach 
for nonparametric classification of data streams. Our more 
sophisticated algorithms feature the most appealing traits in 
stream mining applications: nonparametric classification, in¬ 
cremental learning, dynamic addition of new classes, small 
model size, fast prediction at testing time (logarithmic in the 
model size), essentially no parameters to tune. We empirically 
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Fig. 4. Online performance against model size, averaged over the datasets. 
The model size is relative to the stream length, whereas the online performance 
is measured relative to the top-performing method on each dataset without 
restriction on model size. 


showed the effectiveness of our approach in different scenarios 
and against several standard baselines. In addition, we proved 
strong theoretical guarantees on the online performance of the 
most basic version of our approach. 

Further research will focus on finding a confidence measure 
for the prediction scores, which could be used in a semi- 
supervised framework (e.g., active learning). Another interest¬ 
ing line of research is concerned with finding a more sophisti¬ 
cated and theoretically justified strategy to keep the model size 
bounded. A further, very challenging research line is in the 
direction of taming the curse of dimensionality problem that 
affects all nonparametric approaches. For instance, we plan on 
investigating notions of local dimensions that allow to perform 
dimensionality reduction locally and incrementally. 
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